# Replication Data & Code

Replication data and code for the paper: 'The Changing Economics of Knowledge Production' by Simona Abis and Laura Veldkamp.

## Data sources to replicate all main Tables and Figures: 
- Lightcast (former Burning Glass Technologies - BGT) data. This data is proprietary. We acquired the following data: structured dataset, full textual content of job postings, a sample of time-to-fill. The version of the data utilized for this project was obtained in 2018. This data can be purchased from Lightcast.

 
- Payscale data. This data is proprietary. Payscale shared with us respondent-level data. To the best of our knowledge, this data is not formally for sale, but Payscale could be reached for inquiries.


- Public datasets (JOLTS rates; EDGAR and Preqin indices, risk-free rate).

## Pseudo data:
The cleaning and sub-setting of proprietary datasets is described in the paper and the Internet Appendix.
The following pseudo datasets are included: 
- *categorized_jobs_pseudo.csv*: an anonymized sample of BGT job-level data categorized as AI, OldTech or DataMgmt. 
  - In creating this pseudo dataset:
    - BGT IDs (BGTJobId) were substituted with an index of continuous IDs. 
    - The job date was substituted with a randomly assigned date among the dates in the final sample.
    - Dummy variables were substituted with a randomized dichotomous variable, while satisfying the constraint that each observation included should be either AI, OldTech, or DataMgmt. 
    - The Sector variables were substituted with a random assignment to the 4 sectors present in the final dataset. 
    - An equal number of jobs was assigned to randomly generated Employer_IDs.
    - The number of employers and overall number of observations were chosen to be of comparable size to those in the original data.
  - When run on the full dataset, this file should contain the information for all BGTJobId of interest (i.e., finance related jobs, demanded by employers of interest, belonging to one of the above 3 categories - AI, OldTech, or DataMgmt)


- *salaries_pseudo.csv*: time-series of salary data per job type, substituted with randomly generated variables drawn from a uniform distribution between zero and one. Salaries are then rescaled by 1 million for readability


- *bls_data.csv*: full dataset of JOLTS rates used for estimation


- *GS1M.csv*: risk-free rate used for estimation

## Replication code
The project runs in Python. To reproduce all main tables and figures the file *RFS_main.py* should be run. The file is found in the folder *RFS_Code* (i.e., this project's root directory: ROOT_DIR).

- Updating paths:
  - Prior to running RFS_main.py, the variable (*user_location*) containing the path with the data files should be manually updated by users.
  - The *user_location* variable is found in *RFS_settings.py* (in ROOT_DIR).  
  - The path should contain the pseudo data folder (*RFS_Pseudo_Data*).


- Required packages:
  - Most Python packages required to run this project are imported in *RFS_settings.py* (found in ROOT_DIR)
  - Additional packages are imported in *ROOT_DIR/RFS_Functions/Utilis.py*, *ROOT_DIR/RFS_Functions/Structural.py*, *ROOT_DIR/RFS_Functions/Lininterp.py*, and ROOT_DIR/RFS_main 
  - Scripts do not import additional packages, they only import the *RFS_settings.py* file and the relevant file(s) from *RFS_Functions*.
  - All required packages should be installed prior to running the code.


- Other Requirements
  - Python will create an intermediate outputs folder and a results folder in *user_location*. Hence, Python should be authorized to create folders in the chosen location.
